17-cs02-analysis

Professor Shannon Ellis

3/9/23

CS02: Vaping Behaviors in American Youth (Analysis)

Q&A

Q: I am wondering when can we get the feedback for cs01 since it can help us to make cs02 better.
A: Great question! Grading is underway. We plan to finish Sunday night and release Monday of week 10, so that you can incorporate feedback for cs02.

Q: Why is a video mentioned in assignments for the Final Project?
A: All groups will have to do a presentation. This can be recorded (video) or presented live Thursday of finals week. See description here.

Q: The most confusing part of the lecture was why did we use the histogram plot instead of a bar plot when plotting the variables for case study 2.
A: Hmm….we used a bar plot and not a histogram when doing the plotting. For the skim outputs (if that’s what you’re asking), it will generate a histogram in line for any numeric type variables. If I’m misunderstanding your question, please do follow up!

Course Announcements

  • Lecture Participation survey “due” after class
  • Lab08 Due tomorrow (CS02 EDA)
  • CS02 due Monday of Finals week (3/20; repo + survey)
  • Final Project due Th of Finals week (3/23; repo + presentation + survey)

Notes:

  • Collaboration via git for group work, some options if you’re newer to git:
    • remember: pull before you make changes; push when done
    • work in separate Rmd and combine

Agenda

Packages

library(tidyverse)
library(tidymodels)
library(viridis)
library(srvyr)

Data

…will only work if you finished data set of notes (or open this week’s lab).

load("data/wrangled/wrangled_data_vaping.rda")

Questions

  1. How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
  2. How does e-cigarette use compare between males and females?
  3. What vaping brands and flavors appear to be used the most frequently?
  4. Is there a relationship between e-cigarette/vaping use and other tobacco use?

Q1: How has tobacco and e-cigarette/vaping use by American youths changed since 2015?

Tobacco Use

nyts_data |>
  group_by(year) |>
  summarize("Ever \n (any lifetime use)" = (mean(tobacco_ever, na.rm = TRUE) * 100),
            "Current \n (any past-30-day use)" = (mean(tobacco_current, na.rm = TRUE) * 100)) |>
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students") |>
  ggplot(aes(x = year, y = `Percentage of students`)) +
  geom_line(aes(linetype = User)) +
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(1, 2), 
                        breaks = c("Ever \n (any lifetime use)", 
                                   "Current \n (any past-30-day use)")) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0, 70, by = 10),
                     labels = seq(0, 70, by = 10),
                     limits = c(0, 70)) +
  # this adjusts the background style of the plot
  theme_linedraw() +
  labs(title = "How has tobacco use varied over the years?",
       y = "% of students") + 
  # this moves the legend to the bottom of the plot and removes the x axis title
  theme(legend.position = "bottom",
        axis.title.x = element_blank(), 
        text = element_text(size = 15),
        plot.title.position = "plot")

E-cig Use

nyts_data |>
  group_by(year) |>
  summarize("Ever \n (any lifetime use)" = (mean(ecig_ever, na.rm = TRUE) * 100),
            "Current \n (any past-30-day use)" = (mean(ecig_current, na.rm = TRUE) * 100)) |>
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students") |>
  ggplot(aes(x = year, y = `Percentage of students`)) +
  geom_line(aes(linetype = User)) +
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(1, 2), 
                        breaks = c("Ever \n (any lifetime use)", 
                                   "Current \n (any past-30-day use)")) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0, 60, by = 10),
                     labels = seq(0, 60, by = 10),
                     limits = c(0, 60)) +
  # this adjusts the background style of the plot
  theme_linedraw() +
  labs(title = "How has e-cigarette use varied over the years?",
       y = "% of students") +
  # this moves the legend to the bottom of the plot and removes the x axis title
  theme(legend.position = "bottom",
        axis.title.x = element_blank(),
        text = element_text(size = 15),
        plot.title.position = "plot")

Product Usage

v_colors =  viridis(5)[1:4]  #specify color palatte

nyts_data |>
  group_by(Group, year, n) |>
  summarize(group_count = n()) |>
  mutate("Percentage of students" = group_count / n * 100) |>
  ggplot(aes(x = year, y = `Percentage of students`, color = Group)) +
  geom_point(size = 2) +
  geom_line() +
  scale_color_manual(breaks = c("Neither", "Combination of products",
                                "Only e-cigarettes", "Only other products"),
                     values = v_colors) +
  theme_linedraw() +
  labs(x = "Year") +
  theme(text = element_text(size = 15),
        plot.title.position = "plot")

Q1: What’s the answer?

How has tobacco and e-cigarette/vaping use by American youths changed since 2015?

Your Turn: Given the results thus far, what’s your answer? Anything else you’d explore?

Q2: How does e-cigarette use compare between males and females?

E-cig usage by Sex

v_colors =  viridis(6)[c(3, 5)]

nyts_data |>
  filter(!is.na(Sex)) |>
  group_by(year, Sex) |>
  summarize("Ever \n (any lifetime use)" = (mean(EELCIGT, na.rm = TRUE) * 100),
            "Current \n (any past-30-day use)" = (mean(CELCIGT, na.rm = TRUE) * 100)) |>
  pivot_longer(cols = "Ever \n (any lifetime use)":"Current \n (any past-30-day use)",
               names_to = "User",
               values_to = "Percentage of students") |>
  ggplot(aes(x = year, y = `Percentage of students`, color = Sex)) +
  geom_line(aes(linetype = User)) +
  geom_point(show.legend = FALSE, size = 2) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  labs(title = "How does e-cigarette usage compare between males and females?",
       subtitle = "Current and ever users by sex",
       y = "% of students") +
  theme(legend.position = "bottom",
        axis.title.x = element_blank(),
        text = element_text(size = 15),
        plot.title.position = "plot")

?## Q2: What’s the answer?

How does e-cigarette use compare between males and females?

Your Turn: Given the results thus far, what’s your answer? Anything else you’d explore?

Q3: What vaping brands and flavors appear to be used the most frequently?

Flavors

nyts_data |>
  filter(year != 2015) |>
  group_by(year) |>
  summarize(Menthol = (mean(menthol) * 100),
            `Clove or Spice` = (mean(clove_spice) * 100),
            Fruit = (mean(fruit) * 100),
            Chocolate = (mean(chocolate) * 100),
            `Alcoholic Drink` = (mean(alcoholic_drink) * 100),
            `Candy/Desserts/Sweets` = (mean(candy_dessert_sweets) * 100),
            Other = (mean(other) * 100)) |>
  pivot_longer(cols = -year, 
               names_to = "Flavor",
               values_to = "Percentage of students") |>
  rename(Year = year) |>
  ggplot(aes(y = `Percentage of students`,
             x = Year,
             fill = reorder(Flavor, `Percentage of students`))) +
  geom_bar(stat = "identity",
           position = "dodge",
           color = "black") +
  scale_fill_viridis(discrete = TRUE) +
  theme_linedraw() +
  guides(fill = guide_legend("Flavor")) +
  labs(title = "What flavors appear to be used the most frequently?",
       subtitle = "Flavors of tobacco products used in the past 30 days") + theme(text = element_text(size = 15))

Q3: What’s the answer?

What vaping brands and flavors appear to be used the most frequently?

Your Turn: Given the results thus far, what’s your answer? Anything else you’d explore?

Q4: Is there a relationship between e-cigarette/vaping use and other tobacco use?

Current & Ever

v_colors =  viridis(6)[c(1, 4)]

nyts_data |>
  group_by(year) |>
  summarize(
    "Cigarettes, Ever \n (any lifetime use)" = (mean(ECIGT, na.rm = TRUE) * 100),
    "E-cigarettes, Ever \n (any lifetime use)" = (mean(EELCIGT, na.rm = TRUE) * 100),
    "Cigarettes, Current \n (any past-30-day use)" = (mean(CCIGT, na.rm = TRUE) * 100),
    "E-cigarettes, Current \n (any past-30-day use)" = (mean(CELCIGT, na.rm = TRUE) * 100)
  ) |>
  pivot_longer(cols = -year,
               names_to = "Category",
               values_to = "Percentage of students") |>
  separate(Category, into = c("Product", "User"), sep = ", ") |>
  ggplot(aes(
    x = year,
    y = `Percentage of students`,
    color = Product,
    linetype = User
  )) +
  geom_line() +
  geom_point(show.legend = FALSE, size = 2) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  labs(title = "How does e-cigarette use compare to cigarette use?",
       subtitle = "Current and ever users of e-cigarettes and cigarettes",
       y = "% of students") +
  theme(legend.position = "bottom",
        axis.title.x = element_blank(),
        text = element_text(size = 15),
        plot.title.position = "plot")

Q4: What’s the answer?

Is there a relationship between e-cigarette/vaping use and other tobacco use?

Your Turn: Given the results thus far, what’s your answer? Anything else you’d explore?

Going Deeper: Modelling

Survey Weighting

  • Data come from a survey
  • Not everyone who was sent a survey responded
  • Could over/under-sample from particular groups
  • Sample may no longer be representative
  • Requires: Survey weighting

Survey Weight Calculation

\[\frac{\text{actual proportion of group in the population}}{\text{ proportion of group in the respondents}}\]

Note: We do not need to calculate survey weights for our data; they were already supplied in the dataset, as described in the codebooks.

Variables included

  1. psu: People are sampled within strata; ensure that the responses are representative of the population of interest. Thus, often people first think about ensuring that surveys are conducted in a variety of geographical areas. This is often called the primary sampling unit or PSU. In this survey, the county where the student’s school was located was used as the PSU.
  2. stratum: A categorical variable that indicates subsets of the data that include respondents from different PSUs. In our case, strata are determined by the predominant minority in the PSU (Non-Hispanic Black or Hispanic), whether the PSU is urban or non-urban, and what percent of the students in the PSU fall into the predominant minority group. PSUs are allocated across the 16 possible strata according to the sampling scheme. These strata values allow estimates based on the survey responses to be calculated using different strata allowing for improved precision of the response estimates.
  3. finwgt: The survey weight which was calculated based on a variety of factors.

srvyr

srvyr::as_survey_design(): create a survey object with a specified survey design. Includes information about how the survey was conducted that can be taken into account in the analysis.

  1. strata - the variable(s) that defined strata in the data. (stratum)
  2. ids - cluster ids within the data. (psu)
  3. weight - which variable(s) are the survey weights. (finwgt)
  4. nest = TRUE - forces cluster ids (in this case the PSU) to be nested within the strata

Note: survey weights are specific to a single year of the survey result; will need different design objects for each year

function: surveyMeanA()

surveyMeanA <- function(currYear) {
  options(survey.lonely.psu = "adjust")
  currYear |>
    as_survey_design(strata = stratum,
                     ids = psu,
                     weight  = finwgt,
                     nest = TRUE) |>
    summarize(tobacco_ever = survey_mean(tobacco_ever,
                                         vartype = "ci",
                                         na.rm = TRUE),
              tobacco_current = survey_mean(tobacco_current,
                                            vartype = "ci",
                                            na.rm = TRUE))  |>
    mutate_all("*", 100) |>
    pivot_longer(everything(),
                 names_to = "Type",
                 values_to = "Percentage of students") |>
    mutate(Estimate = case_when(str_detect(Type, "_low") ~ "Lower",
                                str_detect(Type, "_upp") ~ "Upper",
                                TRUE ~ "Mean"),
           User = case_when(str_detect(Type, "ever") ~ "Ever",
                            str_detect(Type, "current") ~ "Current",
                            TRUE ~ "Mean"))}

❓ What’s this accomplishing?

Tobacco Use

nyts_data |>
  group_by(year) |>
  group_modify(~ surveyMeanA(.x)) |>
  dplyr::select(-Type) |>
  pivot_wider(names_from = Estimate,
             values_from = `Percentage of students`) |>
  ggplot(aes(x = year, y = Mean)) +
  geom_line(aes(linetype = User)) +
  geom_linerange(aes(ymin = Lower,
                     ymax = Upper), 
                     linewidth = 1, 
              show.legend = FALSE) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_y_continuous(breaks = seq(0, 70, by = 10),
                     labels = seq(0, 70, by = 10),
                     limits = c(0, 70)) +
    theme_linedraw() +
    labs(title = "Tobacco product users more prevalent after 2017",
         y = "% of students") +
    theme(legend.position = "none",
          axis.title.x = element_blank(),
          text = element_text(size = 15),
          plot.title.position = "plot")

Logistic Regression

\[logit(p)= log_e (\frac{p}{1-p})= \beta_0 + \beta_1 X\]

  • \(p\) as the probability that a student respondent is a current e-cigarette user

Model

dat2015 <- nyts_data |>
  filter(year == 2015, !is.na(Sex))

currEcigSex <- logistic_reg() |>
  set_engine("glm") |>
  fit(as.factor(ecig_current) ~ Sex, data = dat2015, family = "binomial")

(currEcigSexTidy <- tidy(currEcigSex))
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -2.32     0.0377    -61.5  0       
2 Sexmale        0.425    0.0490      8.66 4.73e-18

\[ \begin{aligned} log(odds \ of \ curr. \ e-cig \ use) = \\ &\quad \beta_0 + \beta_1 \cdot Sex = \\ &\quad -2.32 + 0.425 \cdot (Sex == male) \end{aligned} \]

Model Interpretation

\[ \begin{aligned} log(odds \ of \ curr. \ e-cig \ use) = \\ &\quad \beta_0 + \beta_1 \cdot Sex = \\ &\quad -2.32 + 0.425 \cdot (Sex == male) \end{aligned} \]

  • \(0.425 = \beta_1 = \log(OR)\)
  • The log odds of being a current e-cigarette user is 0.425 higher for males compared to females
  • \(1.53 = e^{0.425} = e^{\beta_1} = OR\)
  • The odds of being a current e-cigarette user for males is 1.53 times the odds for males.
  • The odds of being a current e-cigarette user for males is 53% higher than the odds for females.
Note: model does not consider other variables yet…i.e. Age

Survey-weighted logistic regression

dat2015_survey_design <- dat2015 |>
                          as_survey_design(strata = stratum,
                                            ids = psu,
                                            weight  = finwgt,
                                            nest = TRUE)


currEcigSex_svy <- survey::svyglm(ecig_current ~ Sex,
                                  family = quasibinomial(link = 'logit'),
                                  design = dat2015_survey_design)
tidy(currEcigSex_svy)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -2.28     0.0683    -33.4  1.15e-42
2 Sexmale        0.383    0.0700      5.48 7.52e- 7

Survey-weighted Interpretation

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -2.28     0.0683    -33.4  1.15e-42
2 Sexmale        0.383    0.0700      5.48 7.52e- 7
  • \(0.383 = \beta_1 = \log(OR)\)
  • The log odds of being a current e-cigarette user is 0.383 higher for males than for females, taking survey weights into account.
  • \(1.47 = e^{0.383} = e^{\beta_1} = OR\)
  • The odds of being a current e-cigarette user for males is 1.47 times the odds for males, taking survey weights into account.
  • The odds of being a current e-cigarette user for males is 47% higher than the odds for females, taking survey weights into account.

For the case study…

Q4: Is there a relationship between e-cigarette/vaping use and other tobacco use?

Your Turn: What would we likely model given what we’ve discussed so far?

Suggested Reading